-
Notifications
You must be signed in to change notification settings - Fork 224
Fix calculation of nested rep levels #1355
Fix calculation of nested rep levels #1355
Conversation
FWIW this is not correct, a repeated field is always required, with its corresponding definition level always indicating an empty list. You can then nest this within a nullable group to represent nulls. This is documented here. You may find this series I wrote on this helpful. You may also be able to crib from the parquet crates implementation of both the record shredding, schema inference and record delimiting necessary to make this use case work properly. I will forewarn you that this rabbit hole goes very deep 😅 |
Ah, of course. Thanks for the correction; that didn't seem right, but I couldn't figure out how to make it work. If I'm reading this correctly, you basically need two levels of nesting in a parquet schema for every (nullable) list in arrow. If the outer level is not defined, you have a null array. If the inner level is not defined, you have an empty array. Thanks for the linked resources as well. They're exactly what I was looking for. |
@AnIrishDuck thanks a lot for the PR! Unfortunately, as you wrote, I think this was not sufficient. I think this was addressed on #1390 I agree that dremel is a pain ^^ |
Yeah, I was always suspicious that my fix was incomplete, but it was the best I could come up with at the time. Glad to see it addressed in a deeper and more comprehensive way. Seems like I should update this to just include the previously-failing test and validate it works after #1390 ? |
Also, if we were to do the work, would some property testing of the parquet functionality in this crate be accepted? I'm thinking something that generates arbitrary arrow arrays, then verifies they round-trip through parquet back to the same array? |
@AnIrishDuck That would be awesome! Not sure if you are familiar with proptest, but one idea here is to implement Arbitrary to generate arbitrary arrays and use proptest to generate them. |
Yup, that's exactly what I was thinking. No promises, but we hopefully will have some time this quarter to look into this. |
@jorgecarleitao - I finally have carved out some time to look into this. Two organizational questions:
|
Also, I gravitate towards quickcheck instead of proptest as it's less macro-heavy. Any objections, other considerations, or other alternatives I should explore? |
See #1519 |
This one was fun. Porting back from WallarooLabs#7, with tests we can share upstream. I see there was a fix in here already, but it is not sufficient for our use case. I'm not sure if it clashes with our fix; with this both-fixes version, our internal tests pass along with the new unit test.
Parquet is a complex format, and I am not super familiar with this area of code. The solution here feels directionally correct, but incomplete. I have several concerns which I will elaborate on later.
First: the problem. A list array within a struct array was yielding incorrect and odd results.
I traced the root cause back to the rep level encoder in the writer. It was mistakenly adding rep levels for required values. A required value only has one valid repetition count: 1. Parquet is a space-conscious format and therefore nominally omits repetition counts when they are known. The
parquet2
library would thus calculate the max rep encoding length from parquet types and get one value. This max length would not match the max length emitted by the writer.So, if a struct required an inner array, the writer would treat values in that array as level-2 values instead of the appropriate level-1 value. The encoding width emitted by the rep level encoder in arrow2 would thus not match what parquet2 expects. The end result would basically just emit garbage when writing out rep levels, which meant the results read would have the correct underlying values with apparently random breaks between nested lists.
The fix is to omit rep levels for any required fields in the nesting stack. The resulting rep level width then matches what parquet2 expects, and we get the correct nested values.
There are a few further concerns:
num_values
fromrep.rs
throughout the write code. This feels like an overloaded concept, as there are many cases (required values) where values are present but a rep level tape out is unnecessary.